Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: implement multi-stage dvcfile #3584

Merged
merged 1 commit into from
Apr 17, 2020

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Apr 2, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

This PR adds -n flag (hidden at the moment), and will create a multi-stage dvcfile.
Eg:

$ dvc run -n stage1 -d foo -o bar "cat foo foo > bar"

For addressing the given stage:

$ dvc repro Dvcfile:stage1
$ dvc pipeline show --ascii :stage1  # assumes Dvcfile automatically

Multistage file format:

stages:
  stage1:
    cmd: cat foo foo > bar
    deps:
    - foo
    outs_no_cache:
    - bar
   outs:
   wdir:
   ...

Lockfile format

{
  "stage1": {
    "cmd": "cat foo foo > bar",
    "deps": {
      "foo": "d3b07384d113edec49eaa6238ad5ff00"
    },
    "outs": {
      "bar": "5fb7ba7e8447a836e774b66155f5776a"
    }
  },
  "stage2": {
    "cmd": "cat lorem lorem > ipsum",
    "deps": {
      "lorem": "f737a087bca81f69a6048ec744c73e41"
    },
    "outs": {
      "ipsum": "bfb52667a9d6a89c7796332adb1da3cd"
    }
  }
}

Closes #3606

@skshetry skshetry linked an issue Apr 3, 2020 that may be closed by this pull request
@skshetry skshetry force-pushed the multistage-dvcfile branch 2 times, most recently from 70a21f6 to 5866e4c Compare April 9, 2020 10:09
dvc/stage.py Outdated Show resolved Hide resolved
dvc/repo/run.py Outdated Show resolved Hide resolved
@skshetry skshetry force-pushed the multistage-dvcfile branch from 6031b1d to a75c351 Compare April 13, 2020 11:18
@skshetry skshetry marked this pull request as ready for review April 13, 2020 11:18
dvc/stage/__init__.py Outdated Show resolved Hide resolved
dvc/schema.py Outdated Show resolved Hide resolved
dvc/command/run.py Outdated Show resolved Hide resolved
dvc/dvcfile.py Outdated Show resolved Hide resolved
dvc/dvcfile.py Show resolved Hide resolved
dvc/lockfile.py Outdated Show resolved Hide resolved
dvc/repo/__init__.py Outdated Show resolved Hide resolved
dvc/repo/run.py Outdated Show resolved Hide resolved
dvc/repo/run.py Outdated Show resolved Hide resolved
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing that you've managed to preserve backward compatibility πŸ₯‡ I gave it a try and it works pretty nicely! I also like that the commands stay with the old behavior for now, so even though we clearly have lots to clean up here, I would be fine merging this if you fix the biggest issues (biggest in your own opinion as well) for the sake of speed and would just move on with more fine cleaning after (there is clearly a lot of stuff we could make better here).

Now, pipeline stage files can house multiple stages, and separate lockfiles
are created which has the checksums, whereas Dvcfile will be clean and human
readable and editable. The *.dvc files will be generated for output files.

It is available via a hidden flag: -n.

Fixes iterative#1871
PR: iterative#3584
@skshetry skshetry force-pushed the multistage-dvcfile branch from a875e70 to f309b79 Compare April 17, 2020 14:49
@skshetry skshetry changed the title [WIP] dvc: implement multi-stage dvcfile dvc: implement multi-stage dvcfile Apr 17, 2020
@skshetry skshetry requested a review from efiop April 17, 2020 14:50
@skshetry
Copy link
Member Author

@efiop, this should be good to merge. Please take a look (I have fixed all except 1).

@skshetry skshetry self-assigned this Apr 17, 2020
@skshetry skshetry added p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension labels Apr 17, 2020
@efiop
Copy link
Contributor

efiop commented Apr 17, 2020

@skshetry Thanks! πŸ™ Let's run with it. Merging for now, will play around with the codebase some more while adding build-cache support.

@efiop efiop merged commit 128fa7f into iterative:master Apr 17, 2020
@tall-josh
Copy link

Looking forward to having a go at this! I've been hacking together a script that parses a params.yaml to form richer dvc run commands. (ie: adding -o, -m flags etc). This looks like it will do what I need, but in a more complete way.

@tall-josh
Copy link

I just realized multi-stage doesn't seem to support --params deps. Is this on the radar? Or, am I missing something?

@skshetry
Copy link
Member Author

skshetry commented Apr 23, 2020

Hi, @tall-josh. We are still iterating on this issue, and params was left out from the first iteration, because we are still discussing on few key things like:

  1. We create a separate stage files for the outputs. Do users care about this separation or is just a clutter we should avoid?
  2. Do we need lockfile or should that be similar to current *.dvc files that can support multiple stages?
  3. Should that multistage Dvcfile or lockfile be human readable/editable? Should that be multi-line json or yaml?

Looks like you dived into this, maybe you have some inputs?

P.S. And, yes, params will be supported in the next iteration (I don't think we'll have any new release before the support for multi-stage dvcfile).

@tall-josh
Copy link

Thanks @skshetry. My hope was to use the multi-stage files as a way to declaratively put together a pipeline. I'm not sure if this was the intention, but that goal in mind:

  1. Decluttering isn't my driving force, its making things declarative. Personally I create a separate dvc directory to contain the clutter generated by *.dvc files.
  2. I like the idea of the lock file being separate as it would keep the multi-stage file neater. Intuitively I feel like there should be one lock file per multi-stage file. I feel like that's easier for new comers to build a mental model around what's happening, but I don't have a strong opinion either way really.
  3. I would prefer multi-line yaml. Again, it just helps make sense of what's going on behind the scenes.

We have another discussion on #3633 about referencing variables inside dvc (single and multi-stage) files that may be helpful too. These two features would go a long way to my declarative dvc dream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc metrics add: should it dvc add untracked files? store whole DAG in one DVC-file
3 participants